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1. Introduction 

The importance of electronic computers in manifold branches of 
science and technology has continued to grow since the development 
and induction of the first German electronic computer in the 
Astrophysics Division of the Max Planck Institute for Physics in 
Goettingen. Although most computers have been installed for 
purely commercial reasons, the largest and fastest computers have 
been and still are being used primarily in the solution of various 
physical problems. 

The rapid development of semiconductor and switching circuit 
technologies and of signal communication techniques has resulted 
in substantial improvements in computation times, mainframe sizes 
and access times for central processing units (CPUs). In fact, 
the performance of computers has been enhanced so much that for 
some ten years now serially operating high performance mainframes 
have been on the market. Such machines can easily perform one 
million computations per second, vide the CDC 7600 in 1969, the 
IBM 360 195 in 1971 and the AMDAHL 470/VC in 1976. 

During this same period, increasing numbers of physicists have 
immersed themselves in a third branch of physics which came into 
being with the advent of the computer. Thus, in addition to the 
domains of experimental and theoretical physics, there is now a 
discipline known as "computational physics". This field involves 
the transformation of complex physical problems into equations - 
often in idealised form - which are solved approximately on the 
computer using numerical algorithms. Physicists are much interes- 
ted to discover how the approximate solution of a problem will 
change when a variety of different parameters are varied. By 
such techniques it is possible nowadays to save the millions of 
marks which extremely expensive test units would entail. One 
typical instance involves the numerical solution of magnetohydro- 
dynamic differential equations used to determine stable plasma 
equilibria at the Max Planck Institute for Plasma Physics in Garching 
near Munich. These studies are providirjg insights which will enable 
fusion reactors to be built in the future. Another instance involves 
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the numerical solution of Navier-Stokes differential equations to 
determine the physical properties of the flow around objects having 
specific configurations, which is being carried out at the Institute 
for Theoretical Fluid Mechanics within the DFVLR (German Research 
and Test Centre for Air and Space Travel) at Goettingen. Computations 
of this type make possible a reduction in the number of very costly 
wind tunnel tests which need to be performed. 

The other side of the coin is that the huge strides made in these 
fields over the past two decades - which are due primarily to techno- 
logical advances - have also substantially raised the demands now being 
made as to what computers might be expected to accomplish. 

Since little improvement in the performance of serially operating 
computers, i.e. so-called von Neumann computers, has been achieved 
over the past ten years, several computer companies in the seventies 
started to interest themselves in new architectures. Machines known 
as vector and pipeline computers have been on the market since 1976. 
These include the CRAY 1/lS, the STAR 100 and the CYBER 203/205. Para- 
llel computers represent another group and significant examples are 
the ILLIAC IV from Burroughs, the DAP from ICL and the HEP from 
Denelcor. 

We shall concern ourselves here exclusively with the latest systems 
from CDC, CRAY and ICL, namely with the mainframes CYBER 205, CRAY IS 
and DAP. After a brief outline in Chapter 2 of the principal differ- 
ences of the individual machines as compared to serial computers. 
Chapter 3 will be devoted to the languages CRAY FORTRAN, CYBER 205 
FORTRAN and DAP FORTRAN and reference will be made to a simple yet 
typical problem for many applications, viz. the addition of two data 
subsets. In Chapter 4 we consider the numerical solution of modified 
magnerohydrodynamic differential equations in two dimensions using the 
easily programmable single step procedure which works very well on all 
three systems. This procedure may be adapted without any special 
problems as a rapid super step procedure. Extracts of the computer 
program are discussed for all three computers. Our concluding Chapter 
5 presents a survey and, in the opinion of the author, a brief overview 
of the development of computer systems and programming languages up to 
the end of the eighties. 
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At this point we should like to express our sincere thanks o those 
companies and institutions involved in any way in the appearance of 
this report. In parti cuVar, we should like to thank Dr. SchSfer of 
CDCj Mr. R. dbelmesser of CRAY, Mr. W. Erhard of the University of 
Erlangen-Nuremburg and Mr. Sutherland of ICL. Thanks are also due 
to Dr. MUller-Wichards of the DFVLR in Goettingen for his valuable 
comments. 

2. Vector and Parallel Computers 

The most important arithmetical instructions, such as addition and 
multiplication, are carried out in several steps by a conventional 
computer. For instance, floating point addition comprises the follow- 
ing steps (see Figure 1): 

The loading of two elements 

Comparison of the exponents 

Transfer of the mantissas 

Addition of the mantissas 

Normalisation to floating point normal format 

Storage of the result. 

For the serial processing of two vectors by a von Neumann computer, the 
floating point addition in the above case would require four time 
cycles for each pair of elements plus the time taken for the loading 
and storing. For a machine having a processing cycle time of (say) 

32.5 nsec (such as the AMDAHL 470) and a main store cycle time of 325 
nsec, we obtain in very simplified terms 

4*32.5 + 2*325 = 780 nsec. 

This corresponds to a performance of 1.3 MFLOPS (Million FLOATING 
POINT Operations per Second) in our example. If two vectors are 
added, 83% of the time taken for a pair of elements is for loading 
and storage operations and only 17% for the actual addition. 

However, if addition of two vectors from an addition pipe is performed 
on a vector computer, there will be in each segment of the pipe a pair 
of operands in a differing state of processing. After a certain 
start-up time (in our example 325 nsec), a result is produced after 





• •• 



• • • • • 
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22 
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Figure 1. The addition of pairs of numbers = x^. + , 

i = 1,2, . . ,N on scalar, vector and parallel 
computers 
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each processing cycle. In the above case, for a vector length of 64 
we obtain 

325 + 4*32.5 + 63*32.5 + 325 - 2827.5 nsec 

which is equivalent to 

44.18 nsec « 22.6 MFLOPS 

for a single addition. The percentage involving loading and storage 
is now reduced to 23%. 

The operating principle is analogous to the conveyer belt principle 
used for filling beer bottles in a brewery. While empty bottles 
enter the "pipe" at one end of the belt, full beer bottles emerge at 
the other end ready for transportation. Between the starting and 
finishing points there are at the same time numerous other bottles 
in varying states of processing (in the segments). 

Thus, in pipelining a stream r;f operand pairs flow through the pipe. 

In this way, the CRAY IS can attain up to 23 MFLOPS for vectors whose 
length is a multiple of 64. The CYBER 205 has a performance of just 
40 MFLOPS for addition of vectors of length 64, whereas for vectors 
of length 1000 even 88 MFLOPS can be achieved on a two pipe version 
using 64-bit arithmetic. The reasons for this disparity will be 
investigated in the next chapter. Such performance data for special, 
individual operations tell us nothing about the overall performance 
of the particular computers during the operation of extensive produc- 
tion programs found in practice. 

Another feasible extension of serial von Neumann computers would be 
the linking of many such computers to a giant mainframe, though the 
problems arising therefrom are all too evident. In addition to the 
high financial expenditure (the flow model processor developed for 
NASA by Burroughs cost some 100 million dollars), difficult communica- 
tion problems between the units would have to be solved. Moreover, 
the interchange of data and control information would be expensive in 
terms of the time and hardware involved. 

A more sensible approach might be the linking of many smaller pro- 
cessors operating under one single control unit. Each of these 
processors would then carry out simultaneously the same computations. 
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The maximum performance of such parallel computers is proportional 
to the number of processors; in the case of DAP from ICL this yields 
a factor of 64 x 64 = 4096. Although a single processor in the DAP 
requires 135 ysec to perform an addition (=^0.0074 MFLOPS), the DAP 
performance for the addition of two 64 x 64 matrices is 30 MFLOPS. 

A compromise was made for the mutual communication between the pro- 
cessors. Each processor is directly linked to its four neighbouring 
processors, in particular to their 4096-bit memories. If data are 
required from the memory of a processor further away, they have to 
be sent via so-called highways (see Figure 6). 

3o Adding Data Subsets on the Three Systems 

By means of a simple example we now propose to present a brief review 
of the three programming languages CRAY FORTRAN, CDC FORTRAN and DAP 
FORTRAN. This will automatically involve mention of the three archi- 
tectures used. 

Suppose that a scientific problem in the form of a set of differential 
equations (as, for instance, in Chapter 4) cannot be solved in closed 
form using the methods of analytical mathematics. It is often suffic- 
ient for an approximate solution to be obtained which gives numerical 
values to certain points within the region of interest. This type of 
approach is not uncommon in everyday life. If an oil deposit is to 
be determined, for instance, the surveying engineers often manage with 
(very costly) drillings made at intervals of 5 km. 

The solution of many problems in nature are thus approximated by making 
use of lattices. If possible, the probable error arising from such a 
"discrete" solution is also given. We are interested here in the solu- 
tion of a problem on an N*N lattice with N^ lattice or nodal points 
(i,j) with 1 ^ i < N and 1 i j ^ N as follows: 
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Figure 2. The region for the calculations 
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The solution is known aloig the edges of the lattice i = 1, i = N, 
j = 1 and j = N, so that computations need be performed only for 
points within the lattice. For instance, the deflection of a plate 
firmly fastened along its edges is known. The addition of two test 
series A and B would then be carried out for the subsets 

A{2,2) A(2,N-1) 

I I 

I I 

A(N-1,2) - - A(N-l.N-l) 
and 

B(2,2) B(2,N-1) 

I I 

I I 

B(N-1,2) - - B(N-1,N-1) 

To solve this problem on a serially operating computer, it first has 
to be translated into a programming language. In FORTRAN, the most 
coniiiictivly used language in the world for scientific and technical 
problems, the program would assume the following form: 

DIHENSIOU A(N,N), C(N,N) 

N1 » N-1 
READ (3,10) A,B 
DO 1 J => 2,N1 

DO 1 I a 2,N1 

1 C(I,J) » A(I,J) + B(I,J) 

WRITE (6,10) C 
10 FORMAT (8.5) 

STOP 

END 

Initially, therefore, space in the computer memory of dimension 3*N2 
should be reserved for the fields A, B and C, where N is a fixed whole 
number. After reading in A and B and assigning them to their reserved 
fields, the addition is performed by columns starting from the lattice 
point (2,2) and the result returned to memory as C. Finally, C is 
expressed in a fixed format. 
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There are no special problems associated with the, programming of 
this for the user of the CRAY IS, He need not translate his program 
into some quite different language. Apart from a few extensions, 

CRAY FORTRAN is virtually identical to normal FORTRAN. The CRAY 
FORTRAN compiler can read any normal FORTRAN program and automati- 
cally vectorises simple DO loops. For multiple DO loops, as in our 
example, it vectorises in each case only the innermost loop. This 
is primarily because the CRAY IS is a computer which carries out 
vector operations from register to register (see Figure 3). In 
each of eight vector registers there is space for 64 words from 
the memory (v/here they need not be stored as continuous strings). 

The vector elements move out of the register directly through the 
corresponding functional units such as addition, multiplication, etc. 
When 64 elements have been worked through, the relevant register is 
empty and it is then filled again with 64 elements. At this point 
the vector operation starts from the beginning once more. Even 
though the automatic switching to and fro of the data between the 
memory and the register is very convenient, it consumes a great deal 
of time (up to 76$^). Accordingly, the CRAY IS attains a maximal 
performance of 23 MFLOPS for the addition in our problem. If data 
fields are used for differing computations one after the other, the 
intermediate results should not go back to the memory but can be 
further processed directly. In this way the CRAY IS achieves, for 
instance, for the operation 

(x - y) * (x + y), x,y are vectors 
a performance of over 50 MFLOPS. 

To solve the above problem with the greatest possible efficiency on 
the CYBER 205, the program has to be written in CDC FORTRAN roughly 
as follows: 

DIMENSION A(N,N). B(N,N), C(N.N) 

DESCRIPTOR AD, BD, CD, BITD 
N12 => N * (N-2) 

BIT, BITD, BIT (N 12) 

ASSIGN BITD, BIT(1;N12) 

■BITD = Q8VMKZ(2,N;BITD) 

.ASSIGN AD, A( N ,1 ;N12) 

ASSIGN BD, B(N,1;N12) 

ASSIGN CD, C(N,1 ;N12) 


L2 



The symbols have the following significance: 

V = the vector register 
S = the scalar register 
A = the address register 

B = the fast address register (as intermediate memory) 
T = the fast scalar register (as intermediate memory) 


Figure 3. Simplified version of the CRAY 1 configuration 
















13 


As the CYBER 205 is a computer which operates from memory store to 
memory store (see Figure 4), it will reach its best performance for 
long vectors (from N > 200). The latter should be registered in 
the memory store in the most continuous manner possible. The DES- 
SCRIPTOR AD indicates a location in the memory where the field A(N,N) 
is stored in columns as an vector. The information comes from 
the associated ASSIGN instruction in line 7 of the program. Since 
only the test values are required for the inner lattice points, the 
construction of a control vector BIT becomes necessary. The elements 
of this vector are 0 or 1 BIT depending respectively on whether the 
lattice point in question is an edge point or an inner point. The 
appropriate Assembler routine 

Q8VMKZ (2,N;BITD) 

constructs the following BIT vector 

001 . .. 1001. ..10 01. ...1 

^ > ' 

N N N-1 

consisting in all of N*(N-2) elements. If the descriptor AD indi- 
cates the element (N,l) of the last line in the first column (which 
still belongs to the edge), the addition will begin only two elements 
further along because of BIT, i.e. addition begins with the element 
(2,2). The addition can now be carried out under the control of 
this BIT vector by means of the Assembler routine 

CALL Q8ADDNV (, ,AD, ,BD,BITD,CD) 

Without the BIT vector only vectors of length N could be added. 

This would give in the case of N = 52, for instance, a performance 
of around 33 MFLOPS for the CYBER 205 (2 pipe version; 64-bit arith- 
metic). BIT controlled addition with vectors of length N12 = 48*50 
= 2400 enables the CYBER 205 to attain around 95 MFLOPS. It is 
thus the prime task of the CYBER 205 programmer to arrange his data 
in continuous vectors. Assistance in the solution of this and 
similar problems is provided by CDC FORTRAN which contains some 
250 novel instructions. 

Although the CRAY IS operates only in parallel, when vectors are 
subjected to various computer operations, i.e. are. channelled 
through various functional units, just the basic operations of 
addition and multiplication of vectors are carried out in parallel 
in 2 or 4 pipes in the case of the CYBER 205. Thus it is that, for 


instance, the vector elements with uneven location numbers in the 
first pipe and those with uneven location numbers in the second 
pipe are processed. 
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Figure 4. Configuration of the CYBER 205 

In contrast with the so-called vector computers from CRAY and CDC, 
the DAP (Distributed Array processor) from ICL is a parallel computer. 
This performs operations for arrays, e.g. 64*64 fields, in parallel 
on 64*64 identical processors (so-called processing elements) at the 
BIT level. However, this necessitates a totally new approach which 
is far removed from that for serial operations used especially in 
FORTRAN programs. Since many efficient algorithms contain purely 
serial structures (iterations, recursions, element comparisons, 
intermediate enquiries, etc.), they need to be restructured or even 
rejected and new ones developed. The above example assumes roughly 
the following form in DAP FORTRAN for N ^ 64: 

C0MM0N/ADD1/N,A(64,64) ,B(64-,64) ,C (64,64) 

READ (3,10) A,B 
CALL ADD 
WRITE (6,10) C 
10 FORlUVr (8.5) 


STOP 

END 







SUBROUTIHE ADD 

COMMON/ADD1 /N r A ( , ) #JB ( , ) ,C ( , ) 

LOGICAL INTERIOR (,) 

CONVENE (A) 

CONVENE (B) 

CONVFSI (N) 

INTERIOR ■ ROWS(2,N-1) .AND. COLS(2,N-1) 

C( INTERIOR) ■ A+B 
CONVMPE (C) 

RETURN 

END 

A DAP program always consists of two parts; the host part, e.g. for 
an ICL 2980, and the DAP part. In the host part endeavours should 
be made to include all the scalar operations, the input/output and 
the overall controls for the program. Those parts of the program 
requiring the longest computing times should be assigned to the 
DAP part as sub-programs. 

The link between the host and DAP parts is established by means of 
a CALL instruction. Data need not be transferred since they are 
entered by level at the outset (see Figure 5). When required for 
use in the DAP, they are first converted column-wise into the 
individual 4096-bit memories. CONVFME(A) thus me?iins here: convert 
A from the 2900 FORTRAN memory mode into the matrix mode of the 
DAP as a REAL(=i£) value. 

Since we do not wish to carry out the addition over the whole 64*64 
field, we construct a logical matrix INTERIOR composed of the 
entries TRUE (=i 1-BIT) for the points (i,j) of the logical inter- 
section 

{2 S i S N-1> n {2 S j S N-1} 

Whenever the conditions 

i = 1 or j = 1 
N < i < 64 or N < j <64 

hold, the entry FALSE (- 0-BIT) is made. The addition is thereby 
carried out everywhere, but transferred to C only for the chosen 
inner elements. As the result is to be printed out in the host 
part, C is finally converted from the DAP store mode to the 2900 
mode . 
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In DAP FORTRAN too there are numerous extensions to the language 
which make it possible for a programmer to write programs in a 
very straightforward manner. The approach adopted is very similar 
to that for CDC FORTRAN, though in the case of the DAP problems 
need to be formulated in a novel parallel structure. To balance 
this, however, a performance of up to 30 MFLOPS is attained for 
the above example of matrix addition when the 4096 processors are 
fully employed, in spite of the fact that the individual processors 
are relatively slow. 


OAP 



Figure 5. The DAP as part of the 2900 computer 
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Figure 6. Diagram of an individual processor in the DAP 
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4, The Numerical Solution of MHD Equat ~^ 6ns 

In order that an even deeper insight may be afforded into the three 
programming languages CRAY FORTRAN, CDC FORTRAN and DAP FORTRAN, we 
shall discuss in some detail in this chapter the numerical solution 
of non-linear, magneto hydrodynamic differential equations. These 
equations are solved in a cylinder of square cross-section assuming 
that the ideal gas law holds, that changes of state are isentropic 
and that the heat conductivity, viscosity and electrical resistance 
can be neglected. It is then possible to describe a plasma by means 
of the dynamic equations below: 

3v 

+ (v»grad)v • -grad p ♦ ^ x B 
3B 

(4.1) ^ • rot X B) 

m -v*grad p - p*dlv v 


with 0 ^ “ rot ^ and div ^ = 0. Here signifies the current density, 
^ the magnetic field, ^ the velocity and p the pressure of the plasma 
at the point (x,y,z) at time t. The significance of the individual 
equations is explained, for instance, in {82}. Static solutions, 
i.e. those representing equilibria for the system, are of especial 
interest. 

It has been shown in {83} that for static solutions the left side of 
the first equation, i.e. the so-called inertial terms 


(4.2) 


^ + (v • grad) v 
3t 


can be replaced by the frictional term y^. The improved stability of 
the system then formed is balanced by the less tractable parabolic 
equations which have to be solved numerically. The original hyper- 
bolic equations are better behaved, though this disadvantage is 
more or less overcome by the use of implicit procedures or of the 
rapid, explicit super step procedure {84}. 

In what follows we shall restrict ourselves to two dimensions, i.e. 
concentrate on the numerical solution of the problem in the square 
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cross-sectional area of the cylinder 



The MHD equations (4.1) then assume the form: 

• ■re 

VX - - H - BY • JZ 


(4.3) 


VY 


- - + 


BX • JZ 


^ BX » (VX‘BY - VY*BX) 

Tt (VX*BY - VY-BX) 

^ P - -(VX H * VV-|E) - p(^ VX + ^ VY) . 

The initial and edge conditions are derived f . jm the following flow 
function (see {82}); 

I|»(x,y) ■ -~r sin ir(x-1) • sin ir<y-1) , 

4ir^ 


which also provides the exact solution to the problem in (4.1) or 
(4.3). The specific initial conditions for t = 0 are: 

2 2 

p (x,y) ■ ir » t|i 


BX(x,y) ■ “ ly 
BY(x,y) « 1^ 


and the edge conditions for t > 0 are: 


where (n) indicates in each case the normal components of the 
vectors B and y. 

Discretising of the individual parameters is now carried out at the 
following points of the cell (I,J): 
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VX(l+i,J) VX(I,J) 

VY(I+1,J) VY(1,J) 



Figure 7 

Thus, even for the discrete case, the laws of conservation of flow 
and mass still hold. 


We shall now turn to the detailed discussion of the computer program 
in the three different languages CRAY FORTRAN, CDC FORTRAN and DAP 
FORTRAN. The main program has the same form on all three computers: 


cc 

cc 

cc 

cc 


EXPLICIT SINGLE STEP PROCEDURE 
WITH CONSTANT STEP WIDTH 


cc 

cc 

cc 

cc 


cccccccccccccccccccccccccccccccccccccccccccccccccccccc 
c 

c ■ program einzel 

C RXfRY X-.OR. y-COMPONENTS OF B-FICUO, 

C VX*VV X-.OR Y-COMPONENT^ Of VCtOClTY FietD 

C P*PX*PY PRe$$URC. X- UNO Y-COMPONEWTS PF GRADIENT 

0002 REAL BX(62f62)*BY(62t62>«VX(62»62)*VY(62»62)« 

* Pi^ 2 * 62 ) 

0003 COMMON /MAIN/ BX»BY«VXf VYf P 

0004 COMMON /CONI/ N«N1 •N2«L«N11 f NN*LL 

0005 COMMON /CONR/ SDT*DTtDX2«DX«EPS,SV2 

C RCADiNfi IN PARAHETERR: 

C FieuD SIZE ■ 5*M, number. OF ITERATIONS • LL» 

C CUTOFP eftfioR* EPSt TIME STEP « DT 


0006 


READ(5«») m«ll*eps«dt 

0007 


WRITEI6.33) M*LL*EP5fOT 

0008 

33 

format ( 1H1«10X**M B*«l6«/«llX»<Lt 



•El3,6*/*llXt»0T ■•♦F8.4) 

OOOR 


N«5*M 

0010 


N1«N^1 

0011 


N2*N*2 

0012 


NMaN*N 

0013 


NllaNlaul 

0014 


OX*l . /float IN) 

0015 


DX?*DX*nx 
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001 *s 

0017 

OOIfl 

00l‘» 

00?0 

0031 

fl0?2 

0033 

0034 

0035 
003ft 
0037 
003fl 
OOP** 
0010 
0031 
0033 

0033 

0034 

0035 
003ft 
0037 


00*0 


0039 

0040 

0041 

0042 
004.3 
0044 


0045 

004IS 


I CALCULATION OF THE INITIAL VALUES: 

PT*3.l4l59?ftS3S90 

O5I0»0,ft?533 

PlsPI«peTO 

P3*0.2S*PSIO 

r>3»0.S*r(X*oT 

00 1 J*1»N3 

00 1 I«l*N2 

X*<FL04T( l-u •OX-1, 1*PI 

Y« «' FLOAT (J-U *0X-1 , ) •PI 

X?*X-02 

Y2*V-0? 

a*(f,Jl *-Pl»?IN<X)*COS<YP) 

BY<IfJ» * Pl*C0S<X3)^SIN<Y) 
P(I,J)«P?»(SIN(X3)»SIN<Y2) )**2 
1 COMTINUF 

00 2 J*ltNP 

00 3 I»l»NP 

vxn«J) * 0.0 
VY(NJ) » 0.0 
? CONTINUE 
L*0 
507*0 • 

c BEGINNING OF THE ITERATION 

C T*SECONn(TI) 

CALL TEvpL 

r t*secono(ti)-t 

c 

WPITE(ft.20l0) L*SV2*SOTfOT 

WRITF<ft.2000) ( (VX(I*J» *J*l*Nl*M) 

WPITE (6,2000) ((VYdtJl t I*l «Ml *M) * J»l »Nl *M) 
WPITE(6,200n) (( Pd.J) f I*l,Nl.M) *J*l*Nl«M) 

3000 pobmato x.ftEin.a) 

Polo FORMAK ///.11X*»L *• * 16 ./,lix» »5V3 *• •EU',4t/t I IX» 
• t50T«* *«‘in.4./.llX. *• .E10.4.//) 

c WOiTE(6.inn T 

C 101 FOPmaT( ) X, 'CPU-TIMB** *F10.5« ' SEC») 

STOP lid 
FMO 


Here are the details of the main program: 

After an introductory statement of the variables and reservation of 
memory space for the fields BX, BY, VX, VY and P, the constants, 
such as the field size N (N*N = the number of inner points), the 
local step width, DX = 1/N, and (modified) time step, DT = At/Ax^, 
are assigned fixed values. Calculation is then made of the initial 
values from the flow function (lines 16 - 30). The explicit single 
step procedure is formulated in the sub- routine TEXPL and called in 
line 38. At the end of the main program there follow instructions 
for the printout of the results in a given format (L = the number 
of time steps, SVZ = the cutoff error and SDT = the sum of all L 
time steps). 

The sub- routine TEXPL listed below starts at line 12 by editing 
the X and y derivatives of the pressure in the right upper corner 
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of the square lattice (I,J) (see Figure 7). The ZJ term in line 29 
represents the first of the equations (4.3). After evaluation of 
the velocity components VX and VY and VB = ^ x the last three 
equations from (4.3) are edited numerically in lines 53 - 58 . 

The edge conditions are taken into account in lines 59 - 68. In 
the latter part of TEXPL the query is made as to whether a certain 
mean cutoff error has been reached. If this is the case, there is 
a return to the main program. Otherwise, the calculation in line 
10 of TEXPL is continued. 


0002 SU9»0UTTNf: TFXPL 
c 

0003 COMMON /MAIN/ «X,nY«VX»VY,P 

000<* common /coni/ N.Nl fN?»L*NllfNN,LL 

0005 common /CONR/ S0T«DT •DX?«DX,EPS«SV2 

0006 RFAL BXf62,62)*BY(62«62»*VX<62,6?)*YY<62.62)* 

• P(62*62> «DPX(6?«62) ,nPYC62«62) 

0007 PFAL VB(6?f6?) «ZJ(62*62) •PXM(62*62) •RYM(62«62) f 

• VX1 (62.62) «VY) (6?.-6?) .OPXl (^2,62) fOPYl (62t62) « 

• px( 62.62) ,PY(62.62) fPl (62.6?) 

C 

oooo DTlcO.S^’OT 

0009 OT2«0.0*'758*DT1 

0010 13 L»L*\ 

0011 S0T*snT.0T*n«? 

0012 DO 1 Jsl.Nl 

0013 DO 1 I«1.N1 

0014 PX(T.J)»P(1,J)*P(1.J*1) 

0015 Prn.J)»P(I,J)*P(I*l.J) 

0016 1 CONTINUF 


0017 

OOlfl 

0019 

0020 
0021 
0022 

0023 

0024 

0025 

0026 
0027 
002R 

0029 

0030 

0031 

0032 

0033 

0034 

0035 

0036 

0037 

0038 

0039 

0040 

0041 

0042 

0043 

0044 

0045 

0046 

0047 

0048 

0049 

0050 

0051 

0052 


on 2 J*l.Nl 

on 2 I»1.N1 

OPX(I*J)*PX(I*l tUl-RKd^J) 

OPY(I.J)«PY(If J*1)-PY(I*J) 

2 continue 

00 3 J»l»Nl 

DO 3 1*1. N1 

0xM(I.J)*BX(I*J)*BXtItJ»l) 

• 8vm(1,J)*0Y(I»J)*0V(1*1»J) 

3 CONTINUE 

00 4 Jsl.Nl 

no 4 Isl.Nl 

ZJ(I*J)* BY(T*l*J)-BY(lfJ)-0X(l.J*l)*0XCl*J) 

4 CONTINUE 

no 5 J»1*‘U 

no 5 

VXU.J)*(-OPXCIf J)-BYM(I,J)*ZJ(I*J))*0.5 

5 CONTINUF 

00 6 J*2«N 

DO 6 1*1. Nl 

VY ( I • j) a (-0PY < 1 1 J) ♦BXM( I , J) *Z J ( I • J» ) *0 .5 

6 CONTINUE 

DO <7 J*1.N1 

DO 7 I«l.Nl 

VB(I*J)»(VX(If J)*0YM(If J)-VY(I»J)*9XM(I,JM«0T1 

7 CONTINUE 

on 8 J»2.N1 

DO 8 1*1 *N1 

V»1 (I.J)aVX(I,J)^VX(l«J-l) 

OPXl <IfJ)*OPX{I»J)*OPX(ItJ-l) 

8 continue 

DO 9 J*l*Nl 

DO 9 I*2#N1 

Vvi (l,j>*VY(I.J)*VY{I-l*J) 

OPYl (I.J)*OPY(I»J)*OPY<I-l*J) 

Q continue 
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0053 

0054 

0055 


0056 

0057 

0058 

0059 

0060 
0061 
0062 

0063 

0064 

0065 

0066 
0067 
006H 


00 10 J«2*N1 

OC 10 l32fNl 


vYintj)-vvi<i»j-i))) 
-nT2*UVXl<I»J)*VXl(I-l»J)) 
•<OPXI(NJ)*OPXl(T-l*JH 
♦(VYl (NJ)*VVKT.J-in 
• (DPYin»j)»0PYi(i.J-im 
flX(I*J).'<BX<I»J)*(VB(I*J)-VP(If J*l) > 
RY<!»J)*flY(I,J)-<VB(I.J)-V8(I-l»JM 

10 continuf 

00 11 I»ltNl 

Rydfl) sqxn*2) 

Bv(I,N2)*BX(I«N1} 

BYdfl) aBY(2»I» 

9v(M2#naBY(NltI> 

PfI*N2) «PdfNl> 

P(N2»n *P<Nl#I) 

PdtU »Pdt2) 
p.d*n *p<2.n 

11 CONTI WE 


0069 

0070 

0071 

0072 

0073 

0074 

0075 

0077 

0078 

0079 


SV2*0. 

no ]? 

DO 1? 

SV2*SV2*A0s(VX(I.J) ) 

12 CONTINUE 
SV2*SV2^««J»J 

TP (SV2.GE.EPS) GOTO 13 
C ir (L.UT.LL)GOTO 13 

DT=2.«DT1 
RETURN 
END 


The program on a CRAY IS is very similar to the above FORTRAN program. 
Apart from the removal of the 'C in the seventh comment line (which 
is necessary even for certain serial machines), nothing at all changes 
in the main program. The sub-routine TEXPL too is compiled as it 
stands by the CRAY compiler and is automatically vectorised up to the 
DO loop 11 which determines the edge conditions for BX, BY and P. As 
there are no genuine recursions in this loop, vectorisation can be 
brought about by means of the instruction 

CDIR$ IVDEP. 

If the sub-routine were to be written rather more elegantly, for 
instance by splitting up loop 10 into several smaller DO loops, the 
CPU time required would be increased by up to 30%. In fact, the 
CRAY compiler optimises better the more complicated the vector ex- 
pressions in the innermost loop are. In this way the possibility 
of chainings, i.e. the linking together of several functional units 
for (say) (A + B)*C with the vectors A, B and C, are more effectively 
exploited. Finally, loop 12 is replaced by the CAL (CRAY Assembly 
Language) function SASUM. 

As is to be expected, the program is rather more complicated for the 
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CYBER 205. The main reason for this is that the data fields have 
to be made continuous to attain maximal performance from the vector 
computer. Vectors of length 62, for instance, are thus made up into 
vectors of length 3844. In this way, the performance for addition 
and multiplication of vectors - in our case for the whole of TEXPL - 
is improved by a factor of up to 2.5. In all essentials, the main 
program remains in CDC FORTRAN and in ordinary FORTRAN. 


SURROUTINP TEXPL 

COMMON /MAIN/ BX.PYtVX,v/YfPtOPX*OPY 

COMMON /CONI/ N*NlfN2fNU*N22,L 

COMMON /CONP/ S0T*SV2f0T,0X2t0X 

PEAL RX(fi2«A2) fBY(62*62.) •VX(62t6?) «VY<62,62) « 

2 P<62»62) «0PX(62«62) «nPY(62*62) 

OIMENSION VXM(62«62) tVYM(62i62) ,W0S1 ) « 1 1 (61 ) f 1 1 1 (61 ) , 

1 RXM(62«62) «RYM(62t62) ,VR(62,6?) «2J(62«62) « 

2 VXl (62f62)«VX2(62*62)*YX3(62t62)«VX4(62f62)f II (61),I11(61) 
descriptor RIT1D»RIT20,RIT3D*BIT40 

BIT BIT10*BITl(40(tO) 

BIT BIT2D*8IT2(4000) 

BIT BIT30f9IT3(4000) 

BIT BIT40*BIT4(4000) 

r 

N12«N1«N2 
NM12*N2*(N-1 ) 

N121* N12-1 

N3«N*N2-2 

NlX«NJ2-3 

0T1*0.5*0T 

DT2*0.0625*nTl 

ASSIBN BlTlOfBlTl (1IN12) 

RlT1D«Q8VMKO(Nl*N2IBITin) 
assign 9IT20«BIT2(1IN1X) 

RIT2D*Q«*VMKO(N-1,N2«BIT?0) 

ASSIGN BlT30fBlT3(ltN3) 

RlT30a0RVMK0(N*N?tBlT30) 
assign SIT40,BIT4(1IN121) 

BIT40«Q8VMK0(NlfN2<BIT4n) 

II(l»Nl)*08VINTL(2*N2»IT(nNl)) 

III (llNl)*08VINTL(l*N2lTIl(llNm 

I I ( lINU *08VINTL (N1 ,N2 1 T 1 ( 1 INI ) ) 

III (llNI)*QflVINTL(N2*N2ini(HNl)) 

13 L*L*1 

S0TaS0T*0T*0X2 

C 

VXl (1«1INI21)*P(2«1IN121)-P(lf 1IN121) 

* ♦P(2*2IN121)-P.(1*2IN121) 

0PX(!*ltNl21)*08VCTRL(VXl (1,1 IN121) «BIT10IDPX(1,1IN121) ) 

C 

VXl (1,1IN121)»P(1,2IN121)-P(1,1IN121) 

* ♦P(2,2fN121)-P(2,llNl?l) 

OPY(l*llNl21)*Q8VCTRL(VXl(l,llNl21),0ITlDIOPY(l,llN12ln 

C 

VXl (1»1IN12)*BY(2*IIN12)-BY(1,1IN12)-3X(1,2IN12)*BX(1,HN12) 
ZJ.(1»IIN12)»(38VCTRL(VX1 (1,IIN12) ,BIT10IZJ(1 »1 IN12) ) 

r. 

VXl (1,1IN12)*(-OPY(1*1INJ2)*0X(1,IIN12)* 

* RX(l,2IN12n*ZJ(l*llNl2)*0.S 
VY(1,2«NM12)*QRVCTRL(VX1 (l,2INM12),0ITlDIVY(l,2lNMl2n 

C 

VXI(2,1IN1X»*(-OPX(2,IIN1X)-0V(2,IINIX)-BV(3,HNIX))*ZJ(2»1INI 

•X»*0.5 

VX(2,1IN1X)m08VCTRL(VX1(2,1IN1X) ,0IT2OIVX(2«1 INIX) ) 
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VXl (2.2IN3)«VX(2.?|M3) 

* ♦vx(?»nN3)*vxn ♦2»N3>*vxn.i«N3) 

VXM(?.?IN3)*0RVCTRUVX1 (?,2lN3) .RTTini VXM (2t 2* N31 > 
VXl<2f2IN3)*VY(?,?»N3) 

* ♦VY(2»llM3>*VYn .2»N3)*VY(1 .l»N3) 

VYM(?,?IN3)«0RVCTRL(VX1 r?.2»N3> »RTT30»VYM(2*25N3) > 

nxHa.nNi2n«Rx(itiiNi?n*flxn.2iNi2i» 

RVMn*nNl21 )*RY(1 f HNl?n*RY(?.11Nl?l ) 

VXl (If l«Nl21)*<VXn.UN??n«RYHn *1 IN121 ) 

* -VY(lf HN121)*RXH(lf 1 «Nl?n >*nTl 

VB(1»HN121 )«OBVCTRL (VXl <1 ,ltNl?l ) .RT T4D » VR (I t UN121) ) 

VXl (^f^^N3).nPX(?f^»^J3)♦f)PX(2f U»<3)*npxn f??N3)*DPX(lf HN3) 
VX2(2f2IN3)*nPY(?f2*N3>*0PY(?fll*^3)*nPYnf2IN3)*DPY(lf I KM3) 
VX3(2f ^^M3)*VX(^f^^N3)♦^^X(^f nN3)-VX(lf2»KO)-VX(lfllN3) 

* ♦VY(2f2IN3>-VY(2fHN3)*VY(lf2»N3)-VY(lf HN3) 
VX4(2f?IN3)*P(?f2IN3)*n .-0Tl*VX3(2f2IN3) 1-DT2* (VXM(2f 2|N3) 

**VXl (2f2IN3)*VYM(2f2IN3>*VX2(2f2*N3) > 
P(2,?IM3)«(5RVCTRL(VX4(2.?IM3» fBITlOIP (2f 2IKI3) ) 

VXl (2f2IN3)*RX(2f2lN3>* (VB(2f 21^13) -VH(2fl IN3) ) 
RX<?f2»N3)*OflVCTRL(VXl (?f?IN3) *BIT3DI0X (2,2IN3) ) 

VXl (2f2IN3)»BY(2f2|N3)-(VR(2f??N3)-VR<lf2lN3) ) 
PY(2f2?M3)*CiRVCTRL<VXl (2t 2IN31 1 BIT30IBY I2f 2IN3) ) 

W<MNl>*QBVGATHR(BYOf l>Nl?)f II (UNlMlX(llNl) ) 

BY 1 1 f 1 INI 2) «(JBVSCATP ( W ( UNU t II 1 ( UNl MRY (I f 1 1 N12) ) 
W<llNl)«ORVrfATHR(P(lfllN12>fII lllMl ) UMIINI) ) 

P(lf 1IN12)«0BVSCATR(W(1IN1 If III (1 INI) IPdfl IN12H 
Wtl INl)«OBVBATHP(BY(lf HN12) f II (1 INI) ItmiNl ) ) 
RY(lfllN12)«OBVSCATP<>miNl)fIUniN])IRY{lfllN12)) 

w(iiNi)sonvrfATHR(pnfiiMi2)fii (iind iwo ind ) 

PUfllNl?)>aBVSCATR(W(liNl) fill niNl) IP(lfHN12)) 
BX(lfllNl)«RX(l«2lNl ) 

BX(lfN2INl)«RXnfNllNl) 

PnfN2INl)«P(lfNllNl) 

Pdf llNl)»P(lf2INl) 

VXl d«llN22)«VABS(VXdf 1 IN22) IVXl df 1 IN22) ) 
BV2MC!3SSUM(VXldfllN22) ) 

SV2«SV2/N11 

IF(L.LT.IOOO) GOTO l3 

RETURN 

END 



For this reason we shall concentrate here on the sub-routine TEXPL. 

The BIT vectors mentioned in chapter III are an essential feature of 
this CDC FORTRAN program. They are generated by means of the function 

Q8VMK0 

if they begin with a 1 (One) and by means of the function 

QSVMKZ 

if they begin with a 0 (Zero). Although so-called implicit descrip- 
tors are used for the BIT vectors, all other vectors are formulated 
explicitly. This means, for instance, that 

VX1(1,1:N121) 

represents that part of the vector VXl which is stored in the reserved 
field VXl of the memory, begins with the element VX1(1,1) and contains 
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N121 elements (stored by columns). VXl, VX2, VX3 and VX4 are dummy 
fields in which the results of vector operations are stored at the 
outset in continuous fashion. Eventual rearrangement of the storage 
under the control of a BIT vector with the aid of the function 

Q8VCTRL (« Q8 vector control) 

ensures that the relevant edge values are not overwritten with any 
"false edge operation". 

In this way DPX, DPY, ZJ ^ x Bi), VY, VX, VXM (the mean value of 
VX), VYM and VB (“ V x ^) are calculated before the last three 
equations of (4.3) are formulated in CDC FORTRAN. 

One has certain problems associated with the determination of the 
edge values on the upper and lower edges. Although in principle 
the procedure outlined here with Q8VMK0 and Q8VCTRL functions as 
indicated, from each of the N1*N1 computer operations performed now 
only N1 results (N1 = N + 1) are required to store the second line 
of BY in the first line, using for instance 

DO 11 I = 1,N1 
11 BY (1,1) = BY (2,1) 

It is better in such cases to collect the BY(2,I) terms together as 
in the serial LOOP with the function 

Q8VGATHR 

and to distribute them in the first line of BY with the function 

Q8VSCATR. 

For this purpose an index list is needed which in our case will be 
the vector 

II(1;N1) = { 2,N2+2,2-N2+2,.,N*N2+2 } 
having N1 elements (N2 = N + 2). This is produced by the function 

Q8VINTL(2,N2;II(1,N1). 

The GATHER function collects together all the elements in the second 
line of BY and stores them in the dummy field W. The index list for 
the elements in the first line of BY is thus given by: 

II1(1;N ) = { 1,N2+1,2-N2+1,...,N*N2+1 } 

which has N1 elements and is constructed by means of 
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Q8VINTL(1,N2,II1(1;N1)) . 

The elements of W are now stored in the first line of BY under the 
control of the index vector III by Q8VSCATR, 


In an analogous way, 

the second line of P is stored in the first line of P; 
the Nlth line of BY is stored in the N2th line of BY; and 
the Nlth line of P is stored in the N2th line of P. 


Finally, the absolute magnitudes of the components of VX are written 
on VXl and to carry out the precision query the sum of all N2*N2 
elements of VXl is formed by use of 

Q8SSUM . 


A major problem confronting particularly those inexperienced with the 
CYBER 205 is the rearrangement of a two or three dimensional data 
field into a one dimensional field, i.e. a vector. It is suggested 
that the user of a DAP first familiarise himself with three dimen- 
sional fields which can reasonably be stored in the 4096-bit memory 
of the 64*64 processors. For this reason the two dimensional MHD 
program here is more or less ideal for the DAP. 


In this instance too, the main program is practically unaltered. Only 
the “C on the seventh comment card is removed. Since the ICL 2980 
at Queen Mary College in London is a relatively slow host computer, 
the calculation of the initial values is carried out in the DAP part 
which is constructed in the following way: 


1 mma sukoutznb tekpl 3 

2 CX3MM3N /HAIN/ VX,VY,P 

3 CXMCM /OOKI/ NX,MX1,NX2,L,NX11 

4 CXM40N /CXXJR/ SDr,SV2,DT,DX2,DX,Dri ,DT2 

5 O0M40N /ANFA/ BX(,) ,BY{,) ,DPX(,) ,DPY(,) ,DPX1 (,) ,DPY1 (,) , 

6 * VX1(,),VY1{,),BXM(,),B»I(,),VB(,),2J(,), 

7 * PX(,),PY(,) • 

8 REAL VX(,),V3f(,),P(,) 

9 REAL SCO',SV2,DT, DX2,DX, DTI, DT2 

10 I/X3ICAL IMAT1(,),IMAT2(,),lMAr3(,),IMAT4(,),IMAT5(,),IJMAT6(,^ 

11 C 

12 C KONVEPTIERUNG VDN 2900 ZN DAP 

13 C 

14 CALL 0CWVFS1(NX,5) 

15 CALL OCNVFSE(SDT,5) 

16 CALL OCXIVP«E(VX) 

17 CALL CONVEME(VY) 

18 CALL OONVP»E(P) 

19 C 
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20 C AUEBAU DE3« LOdSCHEN MMTBIZEN 

21 C .^RUE. FUER '.CBESZET. 

22 C .FALSE. FUER .AUSSENBAUH * RAND 

23 C 

24 ZMKri«R0N5(1,NX1) .AND.0QLS(1,NX1) 

25 IMAX2-F0HS(2,MX).AHD.CX1LS(1,NX1) 

26 IMAT3>RaHS(1,NXl).AMD.CX]IS(2rN}<) 

27 IMAT4«R0HS(2,NX1).AND.0CaS(2,NXl) 

2B U«T5>ROHSn,NX1).AND.OQIIS(2rNX1) 

29 XMA3%>R0HS(2,NX1).AND.CQIS(1,NX1) 

30 C 

31 C 

32 CALL ANEB 

33 C 

34 11 

35 SDT«SOT+Dr*DX2 

36 C 

37 , PXdMKTI) »P+P(,+) 

38 PY(U«ri) * P+P(-)-,) 

39 DPXdMATI)*: PX(+f)-PX 

40 DPy(iMATi)« py(,+)-py 

41 C 

42 BXM(X«Kri)* BX-)>BX(,-^) 

43 B2M(IMATD* EY4BV(f,) 

44 ZJ0MAT1) *, W(+,)-BY+BX-BX(,+) 

45 C 

46 VX(t«AT2) « (DPX+BKM*ZJ)*(-0.5) 

47 Vy{U4A!r3) - (DPy-BXM*ZJ)*(-0.5) 

48 VBdNATI) « (VX*B»J-V5r*BXM)*OT1 

49 C 

50 VX1 (IMAaS) « VX+VX{ ,-) 

51 WKIHAT6)* W+W(-,) 

52 DPX1(U«r5) « OPX-^DFK(,-) 

53 DEY1 (I«AT6) = DPY+DPy(-,) 

54 C 

55 P(IMAT4) » P*(1.HW1*(VX1-VX1(-,)+W1--Vyl (,-))) 

56 * -Dr2*((VXl+VX1(-,))*(DPX1+DPX1(-,)) 

57 * +(Vyi+V5ri(,-))*(DPYl+DEY1 (,-))) 

58 BX(IWA!r4)» BX+(VB^(,-)) 

59 BY(Umr4)« BY-(V&-VB(-,)) 

60 C 

61 BX(,1) > GK.(,2) 

62 B0C(,I«2) > BX(,NX1) 

63 BY(M « BY(2,) 

64 BX(NX2,) >BV(NX1,) 

65 P(»NX2) « P(,NX1) 

66 P(NX2,) - P(NX1J 

67 P(,1) » P(,2) 

68 P{1,) - P(2,) 

69 C 

70 ZJ * ABS(VX) 

<71 SV2 » SOM{ZJ) 

72 SV2 » SV2/NX11 

73 IF(L.Lr.l) GOTO 11 

74 C 

75 CAUi 0CNVFSI(1K,5) 

76 GAU. QGNVSFE:(3I7F,5) 

77 CALL OCMVHFE(VX) 

78 CALL OONVMFB(VY) 

79 CALL GC»VHFE(P) 

80 C 

81 RESBFN 

82 END 

83 C 
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84 SUBRJUTINE ANEB 

85 C 

86 OSMfX) /HUN/ VX,VY,P 

87 OCmH /ANFA/ BX(,) ,OFX(,) ,DFY(,) ,DPX1(,) ,DEY1 0) , 

88 * VX1(,),V5n(,),aXM(,),BYM(,),VB(,),ZJ(,», 

89 * PX(,),P5f(,) 

90 CCHCN /OOMR/ SZ7r,SV2,DI,DX2,DX 

91 INTEGER V() ,PLACE 

92 REAL’V1(),V2(),VX(,),P(,),VY(,) 

93 C 

94 V»0 

95 PI»3. 1415927 

96 PSI0>0. 02533 

97 PSI1»PSI0*PI 

98 C 

99 c imnax v mti vcn o bis 63 

100 c 

101 PIACEi^l 

102 DO 1 K=1,6 

103 V (ALT (PIAIZE) ) "V+PIACE 

104 1 PIACE«PLACE*2 

105 C 

106 VX»0. 

107 VY=0. 

108 PX=0. 

109 py=o. 

110 DPX«0. 

111 DPY:=--0. 

112 VX1=0. 

113 vyi=o. 

114 DPX1=0. 

115 DPyi«=0. 

116 BXM =0. 

117- BXM =0. 

118 VB *=0. 

1,19 ZJ «0. 

120 C 

121 V1»pr*(£HX)AT(V)*nX-1.) 

122 V2»V1-0.5*DX*PI 

123 BX«-PSI1*MATC(SlN(V1))*MrffR(a3S(V2)) 

124 BY« PSI1*MA1C(0C3S(V2))*MKIR(SIN(V1)) 

125 P ■0.2SnBI0*MATC(SIN(V2)**2)*MATR(S2N(V2)**2) 

126 C 

127 RETURN 

128 END 


I 

Because the DAP memory is a part of the host memory, the calculated 
or reserved words from the main program are already in the DAP though 
in the incoi^rect memory mode for the DAP. Accordingly, they are 
converted at the outset as described in Chapter III. Logical matrices 
are then constructed in analogy to the BIT vectors on the CYBER 205. 
This too has been discussed in some detail earlier on. The initial 
conditions are called up from the DAP sub-routine ANFB which serves 
to construct a vector V having elements from 0 to 63 by means of the 
built-in function ALT, Parallel processing is thus made possible in 
lines 123 to 125 of the equations for BX, BY and P. Here, for ins- 
tance, the matrix 

•j 

^ MATC(Z) 

) 

I 

(C stands for column) signifies that the first element of the vector 

I 

i 


U 
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Z occupies the whole of the first line MATC(Z), the second element 
the second line, and so on. (The same holds for MATR as far as the 
columns are concerned.) 

In lines 37 to 59 the equations are then evaluated, initially on the 
right hand side of the equality sign over the whole 64*64 field. 

Here, for instance, P{+») indicates that the whole P field moves up 
one line. The last line is filled with zeros and the first is dis- 
carded. The results are stored only when a TRUE appears in the 
logical matrix LMAT. 

After insertion of the edge values for BX, BY and P, ABS and SUM 
follow to determine the mean error. To conclude, those magnitudes 
which are to be printed out in the main program are converted again. 

5 . Summary and Overvi ew 

The investigations described in Chapters 3 and 4 have demonstrated 
that mastery of even apparently straightforward problems with new 
vector and parallel computers involves novel approaches. It is 
often not cost-effective to use traditional FORTRAN programs without 
any modification on these new computing systems. The inefficiency 
which can arise by doing this is illustrated by a comparison of the 
computation times required for ten very different programs in fluid 
mechanics {44}. In such a test, CRAY 1 and CYBER 205 machines were 
faster than the AMDAHL 470/V6 by the following factors; 

CRAY 1 CYBER 205 

serial 12.5 5.7 

vector 25.9 25.8 

Serial means here that the programs used underwent scalar tests with 
autovectorisers in operation. The extent to which individual programs 
can deviate from these mean values is shown by the results for the 
very readily vectorised and parallelised MHD algorithm discussed in 


the previous chapter. 




CRAY 1 

CYBER 205 

serial 

44.7 

10.8 

vector 

44.7 

49.7 


The rapid serial times for MHD on the CRAY 1 come about largely as a 
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result of the relatively good autovectori ser of the CRAY compiler 
which has the function of recognising and vectorising inner DO loops 
capable of vector! sati on. This task is much easier than that of 
the CYBER autovectori ser which is required to arrange given data 
structures to produce continuous data fields which are then able 
to flow through pipes as vector flows of maximum possible length. 
There is no comparable analogy for the DAP since an ordinary FORTRAN 
program cannot be run on a DAP without meshes. Moreover, there is 
no "autoparalleliser” available yet. It is the opinion of the 
author that the DAP occupies the position indicated in a list of 
well-known von Neumann computers. An MHD run requiring one minute 
of computing time on an AMDAHL 470/ V6 would require on the following 
computers: 


ICL 2960 

30 

min 

* 

UNIVAC 1106 

15 

min 


CDC 6400 

10 

min 


IBM 370-158 

8 

min 

* 

ICL 2980 

5 

min 


CVBEB 173 

4 

min 


AMDAHL 470/V6 

1 

min 

* 

CYBER 175 

45 

sec 

* 

CDC 7600 

20 

sec 

* 

STAR 100 

10 

sec 

*. 

CYBER 203 

6 

sec 

* 

DAP 

5 

sec 

* 

CrAY-1S 

2 

sec 

* 

CYBER-205 

2 

sec 

* 


The starred times are based on the experiences and results of the 
DFVLR group {44}. 

The performance curves for von Neumann, vector and parallel compu- 
ters are also interesting in the case of our problem. Once again 
the computers are the AMDAHL 470/ V6, CRAY IS. CYBER 205 and the DAP, 
the problem is the MHD one and N is the number of lattice points in 
a dimension (the vector length on the CYBER 205 will then be equal 
to (N-2)*N). 

The time steps for the CRAY IS and DAP curves arise because of the 
restriction on the content of the vector register to 64 elements or 
the restriction of the processors to 64*64. Similarly, for the 
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CYBER 205 vectors will need to be re-scheduled if they are longer 
than 2^® = 65536 elements, the content of a large page. This would 
happen in our case for N = 256. 


0.03 -j 

sec 
0.02 H 


2 0.01 
0 ) 


0.00 




AMDAHL 64Bi1s 

AMDAHL 32Bif$ 

OAP 32Bi1s 

CRAY-1 64Bi1s 

CYBER 20S 64BiU 
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Figure 8 


The following picture emerges when the results for individual mach 
ines operating the MHD program are compared: 


N 

10 

20 

30 

40 

50 

60 

CRAY : AMD. 

18.7 

25.2 

30.9 

34.6 

36.8 

38.2 

CYBER : CRAY 

1 .28 

1.59 

1 .55 

1 .52 

1.49 

1 .46 

CRAY : DAP 

46.8 

18.0 

9.4 

6.0 

4.0 

3.0 

CYBER : AMD. 

23.9 

40.0 

48.0 

52.6 

56.4 

55.7 

CYBER : DAP 

60.6 

27.9 

14.7 

9.0 

6.0 

4.3 

DAP : AMD. 

0.4 

1.4 

3.3 

5.8 

9.1 

12.9 


It should be borne in mind here that the computations were carried 
out on the DAP using 32-bit arithmetic and on all other machines 
using 64-bit arithmetic. 

To conclude we shall risk taking a view of the short-term future. 
In addition to the implementation of 4K-bit chips and 16K-bit ECL 
chips, CRAY Research, CDC and ICL are now concentrating on refine- 
ments in their architectures. In 1981 CRAY extended its already 
rapid I/O buffer memory up to a maximum of 8 million words. In 
addition, the transfer rate from this memory to the main memory 
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has been increased by up to 2*850 million bits per second, an impor- 
tant development for possible faster vector operations. 

Around 1984 a CRAY 2 machine is expected which should be some 5 times 
faster than the CRAY IS. This is to be accomplished by means of new 
switching circuitry which has been developed in CRAY'S laboratories. 

Another conceivable improvement would be the attenuation of the so- 
called "refill" which the CRAY IS is subject to after each 64 ele- 
ments when the registers are emptied and filled up again. The 
underlying cause is not the small size of the vector registers, as 
is often assumed, but the fact that only one current of vector 
elements can flow from/ to the registers from/ to the memory. Enlar- 
gement of the registers from 64 to 128 words for vector addition and 
multiplication would result in an increase of around 10 % to just 25 
MFLOPS. If it were possible to have two vector flows between the 
memory and vector registers, some 38 MFLOPS could be attained for 
addition and multiplication. This, however, would make heavy 
demands on the hardware. 

One of the most interesting developments in the software sector is 
certainly the production of a Pascal compiler at Manchester Univer- 
sity for the CRAY mainframe. This compiler should be able to pro- 
duce vectorised code. 

The CYBER 2XX is expected on the market around 1986. With a station 
time of 8 ns (CYBER 205 has 20 ns) and 8 vector pipes (CYBER 205 has 
a maximum of 4), the scalar processor should be 2.5 times faster and 
the vector processor 5 times faster. The use of 4K-bit chips and 
16K-bit ECL chips should extend the memory capacity of the main 
memory to 8 million 64-bit words. 

It can also be assumed that the CYBER software will be considerably 
enhanced in the next few years. Among other things, the autovecto- 
riser will certainly become more efficient. A simplification of 
the CDC vector FORTRAN would also be justified, e.g. abbreviation 
of the Q8 instructions, simplification of the BIT vector construc- 
tion, simplified statements (such as AD(BITD) = BD+CD), the intro- 
duction of certain strictly implemented standard bit vectors, etc. 
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ICL is primarily concerned with increasing the DAP memory from 2 to 
8 million bytes 2 million 32-bit words) by replacing the 4K chips 
in the individual processors by 16K chips. A 64-bit arithmetic 
may then be feasible (roughly double the precision) as this is 
urgently required for many scientific and technical problems. An 
improved performance of the individual processors can also be expe- 
cted (addition now 0.007 MFLOPS), especially after the beginning of 
October 1981 when an announced joint venture between ICL and 
Fujitsu will give ICL access to very advanced Japanese micropro- 
cessor technology. We can also expect the relatively slow host 
computer ICL 2980 to be replaced by a much faster machine. 

The extension of the 64*64 DAP to a 128*128 DAP should present no 
problem from the hardware standpoint, though for the end user more 
complicated programs will mean that full exploitation of the 
128*128 = 16384 processors will be even more difficult. The four- 
fold increase in cost (at least) should be counterbalanced by a 
performance improvement in the DAP part of around 2.5 times {45}. 
According to Minsky's hypothesis {89}, which states that multipli- 
cation of processors by a factor of p results only in an increase 
in performance of In p, the improvement in the performance of the 
host and DAP parts together should be only 1.4 for the 128*128 
configuration. Initially, at least, there will be little improve- 
ment on the 64*64 configuration. 

After our editorial deadline, the sub-routine TEXPL was revised 
using the latest CDC FORTRAN. In particular, replacement of the 
Q8VCTRL statements by so-called WHERE blocks resulted in a further 
reduction of the computing time by about 20%. Thus, the factor 
for the comparison of AMDAHL: CYBER 205 is now improved from 55.7 
to 69.6. This corresponds to a speed of around 100 MFLOPS for the 
processing of the production program TEXPL by CYBER 205. 
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